-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unzip: the inverse of zip #33515
base: master
Are you sure you want to change the base?
unzip: the inverse of zip #33515
Conversation
0f3305f
to
c820a50
Compare
c820a50
to
e4bdcac
Compare
It seems better to return a tuple. Maybe something like
Of course the downside to that is it uses more temporary space. We could optimize it later by hooking in to the |
I've really enjoyed using https://github.com/piever/StructArrays.jl/ and seeing it mature, FWIW. |
Definitely. If we wanted to add things to Base, StructArrays would be on my short list. It's arguably the most useful general "shape" of container that's not really provided by Base. |
5cadaf6
to
8c8f717
Compare
It was simple to change this implementation to return a tuple: just convert it at the end. Since one expects the number of collections returned to be small, this should be quite cheap. |
I think this is about the most straightforward eager The potential alternative that makes sense to me would be to have |
One possible variation would be to call julia> unzip([[1, "apple"], [2.5, "orange"], [0, "mango"]])
([1.0, 2.5, 0.0], ["apple", "orange", "mango"]) |
How would the performance of this compare to my version? |
I haven't compared the performance, but if the goal is to get a tuple of unzipped vectors, this ought to be about as fast as possible since it doesn't do any extraneous work. Do you have a test case that you've been doing benchmarking on? |
1557a09
to
9ac2c0b
Compare
The performance issues I'd expect here would be the lack of type info in |
True. I could rewrite it so that it collects into a tuple and replaces the whole tuple if the eltype of any of the slots needs to change with the usual recursive trick. |
9ac2c0b
to
4b839aa
Compare
Here's a version that does the widening recursion trick. This should be fast in the case where |
You're doing that anyways, because of the call to |
My main thought was to avoid any dynamic dispatch on each loop iteration and only do it once before the loop. I also figured that if the compiler knows about the length and type of each slot in |
Neither not using |
Using the following test data: itrs1 = collect(enumerate(eachline("/usr/share/dict/words")))
itrs2 = map(t->[t...], itrs1)
itrs3 = collect(enumerate(rand(1:100, 10^6)))
itrs4 = map(t->[t...], itrs3) Benchmark results on this branch: julia> @benchmark unzip($itrs1)
BenchmarkTools.Trial:
memory estimate: 10.78 MiB
allocs estimate: 470760
--------------
minimum time: 14.655 ms (0.00% GC)
median time: 16.224 ms (0.00% GC)
mean time: 16.807 ms (3.20% GC)
maximum time: 23.046 ms (29.50% GC)
--------------
samples: 298
evals/sample: 1
julia> @benchmark unzip($itrs2)
BenchmarkTools.Trial:
memory estimate: 147.54 MiB
allocs estimate: 4715666
--------------
minimum time: 522.516 ms (1.65% GC)
median time: 546.774 ms (3.06% GC)
mean time: 563.799 ms (2.50% GC)
maximum time: 691.221 ms (1.36% GC)
--------------
samples: 10
evals/sample: 1
julia> @benchmark unzip($itrs3)
BenchmarkTools.Trial:
memory estimate: 61.03 MiB
allocs estimate: 1999495
--------------
minimum time: 29.845 ms (0.00% GC)
median time: 38.098 ms (16.04% GC)
mean time: 36.911 ms (12.67% GC)
maximum time: 44.987 ms (20.47% GC)
--------------
samples: 136
evals/sample: 1
julia> @benchmark unzip($itrs4)
BenchmarkTools.Trial:
memory estimate: 564.54 MiB
allocs estimate: 17997947
--------------
minimum time: 1.594 s (3.17% GC)
median time: 1.604 s (3.55% GC)
mean time: 1.616 s (3.69% GC)
maximum time: 1.661 s (4.44% GC)
--------------
samples: 4
evals/sample: 1 Benchmark results on #33324: julia> @benchmark unzip($itrs1)
BenchmarkTools.Trial:
memory estimate: 3.60 MiB
allocs estimate: 6
--------------
minimum time: 1.094 ms (0.00% GC)
median time: 2.678 ms (0.00% GC)
mean time: 2.573 ms (11.13% GC)
maximum time: 12.254 ms (70.55% GC)
--------------
samples: 1938
evals/sample: 1
julia> @benchmark unzip($itrs2)
ERROR: BoundsError: attempt to access 0-element Base.Rows{Tuple{},1,Tuple{}} at index [Base.OneTo(235886)]
julia> @benchmark unzip($itrs3)
BenchmarkTools.Trial:
memory estimate: 15.26 MiB
allocs estimate: 6
--------------
minimum time: 1.454 ms (0.00% GC)
median time: 2.490 ms (0.00% GC)
mean time: 3.486 ms (28.96% GC)
maximum time: 12.152 ms (56.13% GC)
--------------
samples: 1432
evals/sample: 1
julia> @benchmark unzip($itrs4)
ERROR: BoundsError: attempt to access 0-element Base.Rows{Tuple{},1,Tuple{}} at index [Base.OneTo(1000000)] So @bramtayl's version does have really impressive performance in the inferrable case, but it fails in the non-inferrable case. I guess that means that this code can stand some optimization. |
Basically, it comes down to being tricky to convince the compiler to unroll the inner unzip loop even though it knows the length and type of the value tuple and the tuple of vectors to write into. |
I'm going to give this a friendly bump. It would be nice to have in 1.7. |
If we put this in Base, I do think it should be inferable and as fast as possible, especially since some good implementations (for particular cases) are already available in packages. If I'm reading it right the big issue with the implementation in #33515 (comment) is that it does multiple passes over the iterator, and so is not general enough. I will try to experiment with the StructArrays kind of approach I was talking about above. |
https://github.com/bramtayl/Unzip.jl exists and is registered (I think it's pretty performant, but I'm sure there's ways to speed it up more) |
Yes, that's the approach I'm talking about. Unzip.jl seems quite good --- fast, infers, and no generated functions. StructArrays.jl does the same thing as well, but it is astonishing how different the code is. It uses some generated functions, and also re-implements some of the Base |
Well, |
I am a proponent of allowing Tuples and NamedTuples interchangeably. |
I noted this in an issue at Unzip, but I should note that my main use-cast is when I do
i.e. I use |
Agreed it might be nice for it to work with tuples or namedtuples. What would be even cooler would be to merge namedtuples and tuples into the same thing, where you could refer to items by name if they have it, or else just by number. Something like |
That design would certainly have its own trade-offs; I don't think it's universally better. |
Out of curiosity, what would be the downsides? |
While it may make sense to have
Allowing unnamed entries within NamedTuples seems a confusing tack to me. |
Hmm, well, I guess it's a mostly unrelated issue. But if we want an unzip that works with NamedTuples and Tuples, it would make it a lot easier if Tuples and NamedTuples were...less different? |
Implementing a really basic |
bump! can we get tests and news so this can be closed? |
I added some tests, NEWS, and a compat annotation, and fixed a bug due to a missing import. There is currently a test failure for the case of an empty iterator of iterators, where the behavior of returning julia> itrs = ((), ())
((), ())
julia> collect.(itrs)
(Union{}[], Union{}[])
julia> unzip(zip(itrs...))
() Note that this also makes julia> unzip(Tuple{Int8,Bool}[])
()
julia> unzip(Tuple{Int8,Bool}[(3,true)])
(Int8[3], Bool[1]) I know that the empty-iterator case is notoriously tricky, but can we do something better at least in the case where the |
Closes #13942. Alternative to #33324, with apologies to @bramtayl, who did a lot of work on unzip.